Exploration of Different Imputation Methods for Missing Data

Author

Karthik Aerra, Liz Miller, Mohit Kumar Veeraboina, Robert Stairs

Published

August 4, 2024

Slides

Introduction

Missing data is a key challenge that data professionals encounter in their daily work. In the real world, data is often messy, incomplete, and requires a level of pre-processing before data analysis can occur. Part of data pre-processing is dealing with missing data. Missing data can occur for several reasons including but not limited to processing error, machine or measurement errors, non-responses in surveys, invalid calculations (e.g. division by zero), and study participant dropout (Emmanuel 2021). Missing data can be problematic for analysis because it can reduce the number of usable observations, introduce bias, and interfere with different analysis tools such as machine learning algorithms, which cannot automatically handle missing data.

One of the first steps in handling missing data is to determine how much of the data is missing and from which columns. Missing data in a dataset can be represented in several ways including NA, NaN, Null, None, blanks, spaces, empty strings, or placeholder values such as 999999 (or equivalent obvious numerical outlier). For standard missing values such as NAs, functions such as is.na() in R can be used to count the number of missing observations. For string-related missing values such as space (“ ” or empty strings (“”), functions such as unique() in R can be used to identify unexpected strings. For numerical placeholders such as 0, 99999 (or equivalent), or negative values, histograms of the data can be useful to identify missing values.

If a dataset contains missing data, there are two basic approaches: deletion or imputation (Schefer 2002). Deletion of data can be problematic for analysis, especially when the proportion of missing data is high or the overall number of observations is low. Deletion of data can also lead to bias or underestimation of variance if the data are not missing completely at random. Imputation is the process of replacing missing values with estimates, which can be accomplished using statistical or machine learning methods. It is crucial when choosing an approach to missing data that the underlying mechanism(s) for the missing data are understood.

Missing Data Mechanisms

The mechanisms for missing data can generally be broken up into three categories (Mack, Su, and Westreich 2018), (May et al. 2009), (Bennett 2001), (Little et al. 2014), (Du, Hu, and Zhang 2020):

Missing Completely at Random (MCAR):

When data are MCAR, the probability of a record missing is independent from observed and unobserved data. In other words, the likelihood of an observation having missing data is the same across all observations (Buuren 2018). The pattern for missing data is unrelated to the data itself. For example, if a continuous monitoring system periodically loses network connectivity or power, and that loss cannot be attributed to other factors (e.g. random), this missingess pattern can be considered MCAR. In the case of MCAR data, the deletion of data reduces the number of observations in the dataset but does not introduce bias. MCAR is the most ideal, but least realistic scenario. The probability function for MCAR data can be described as follows:

Probability function for MCAR data, (Buuren 2018)

Where Yobs is the observed data, Ymis is the missing data, and ψ contains the missing data model parameters. The above statement indicates that the missing data are not dependent on observed data or missing data.

Missing at Random (MAR):

When data are MAR, the probability of a record missing depends on the observed data, but not the unobserved data (Buuren 2018). If performing a survey with age as a variable, women may be more likely to omit their age in the survey. The probability of an observation having a missing data point for age can be therefore be attributed to another observed variable. In the case of MAR, deletion of entire observations with missing data may or may not introduce bias. MAR is typically the underlying assumption for mosy imputation methods. The probability function for MAR data can be described as follows:

Probability function for MAR data, (Buuren 2018)

Where Yobs is the observed data, Ymis is the missing data, and ψ contains the missing data model parameters. The above statement indicates that the missing data are dependent on observed data but not missing data.

Missing Not at Random (MNAR):

When data are MNAR, the probability of a record missing is dependent on the unobserved data. The probability of a missing datapoint is variable, but cannot be accounted for with other known, observed variables. There may be a factor that can predict the probability of a missing value, but it is unknown. MNAR data are the most complicated to analyze. Similarly to MAR, deletion of observations with missing data may or may not introduce bias to the analysis. The probability function for MNAR data can be described as follows:

Probability function for MNAR data, (Buuren 2018)

Where Yobs is the observed data, Ymis is the missing data, and ψ contains the missing data model parameters. The above statement indicates that the missing data could be dependent on observed data or missing data.

Figure 1: MCAR, MAR, and MNAR Example, where M is the probability of a missing data point, X is the observed data, and Y is the unobserved data. (Du, Hu, and Zhang 2020)

Understanding the mechanism behind missing data (MCAR, MAR, or NMAR) is important because many imputation methods require the assumption of MCAR or MAR to be held true. Statistical tests such as Little’s MCAR test, Hawkin’s test and non-parametric test can be used to support whether data or MCAR or not. These tests will be discussed further in the methods section. Visualization of the missing data is also helpful for determining the relationships (or lack thereof) within the missing data. For example, correlation matrix plots, scatterplots, and other custom visualizations can be useful tools in identifying patterns in the missing data (Ghazali, Shaadan, and Idrus 2020).

Techniques for Handling Missing Data

Methods for handling missing data can be split into three broad categories (Ren et al. 2023), (Emmanuel 2021):

Deletion methods:

One of the approaches to handling missing data is to delete instances of missing values. This can be accomplished through either listwise deletion or feature selection, for example. In listwise deletion, all rows with missing values are deleted from the dataset. This results in a dataset with a lower number of observations. In feature selection, columns with high proportion of missing data are deleted from the dataset. In this way, the number of observations in the dataset is maintained, but the total number of features is reduced. Deletion methods are advantageous in that they are quick and simple. With a large dataset and relatively low amounts of missing data, reducing the number of observations or features may be okay. However, deletion can introduce bias into the analysis if the data are not missing completely at random, which is often the case (Ren et al. 2023), (Emmanuel 2021).

Single imputation:

In single imputation, missing data values are replaced with a single value, without consideration of the uncertainty around the prediction for the missing value. In other words, a single value is predicted for each missing data point. This can be accomplished using techniques such as mean imputation, median imputation, hot and cold deck imputation and regression-based techniques. There is not a model defined for predicting the values in the column as a whole, so there is information lost around the variance of the prediction. Single imputation methods are simpler than multiple imputation methods, but they tend to underestimate the variance resulting models may have artifically low confidence intervals (Ren et al. 2023), (Emmanuel 2021).

Multiple imputation:

In multiple imputation methods such as MICE, a model is constructed to predict missing values in each column based on observed data in other columns. This model provides a point estimate for the missing values, as well as the uncertainty around the predictions. Multiple iterations are performed to predict the missing values. In each iteration, a missing value is predicted based on the model and its uncertainty. That is, with multiple iterations, you will end up with a distribution of possible missing values. In this way, multiple datasets with multiple possible values for the missing data are generated. Once the iterations are complete, the missing data predictions can consolidated back into a single dataset, where the missing values will be a result of the point estimate and the variance or uncertainty from the model. Multiple imputation can be a powerful technique that is helpful in preserving some of the variance or uncertainty around missing values, but it can be complex and computationally expensive. (Ren et al. 2023), (Emmanuel 2021).

Imputation methods can be further broken down into statistical versus machine learning methods. According to (Lin and Tsai 2020)’s review of literature from 2006-2017, four of the most common statistical methods for missing value imputation include expectation management (EM), linear regression, least squares (LS), and mean/mode imputation. For machine learning, the four most common methods include clustering, decision trees, random forest, and KNN.

Theories for Missing Data

Rubin’s Missing Data Theory:

This foundational theory categorizes missing data into MCAR, MAR, and NMAR. It provides a framework for understanding the mechanisms behind missing data and guides the selection of appropriate imputation strategies based on these mechanisms(Rubin 2004).

Statistical Theory of Imputation:

This theory involves using statistical models to estimate missing values, leveraging the relationship between observed and missing data points. Multiple imputation, a prominent technique within this theory, generates multiple plausible values for each missing data point to account for uncertainty (Schafer 1997).

Machine Learning Theories:

Machine learning techniques are increasingly employed for imputing missing data due to their ability to learn patterns from data and make predictions. Methods such as k-nearest neighbors (KNN) and neural networks are applied to predict missing values based on observed data patterns (Batista and Monard 2003).

These theories provide a comprehensive framework for understanding and implementing imputation methods across various datasets and analytical contexts.

Research Methods for Missing Data:

Descriptive Studies:

These studies analyze the patterns of missing data and evaluate the effectiveness of various imputation methods. Examples include the work by Jerez et al. and Güzel (Jerez 2010), (Guzel 2013).

Comparative Studies:

These studies compare the performance of different imputation methods. Joseph G. Ibrahim and Siming Zheng have conducted such comparative analyses, highlighting the strengths and weaknesses of various approaches (Ibrahim 2011), (Zhang 2024), (Zheng, Zhang, and Zhou 2023).

Simulation Studies:

These studies test imputation methods on controlled datasets, allowing researchers to systematically observe the impact of different techniques. This helps in understanding the conditions under which each method performs best.

Application-Based Studies:

These studies apply imputation methods to real-world data, demonstrating their practical use and the challenges involved. Examples include the works by Saqib Ejaz Awan and Emmanuel, which show how imputation methods are implemented in practical scenarios (Awan 2022), (Emmanuel 2021).

Theoretical Analysis:

Critical decisions based on statistical theories and useful techniques are involved in addressing missing data in datasets. By categorizing missing data into MCAR, MAR, and MNAR, Rubin’s Missing Data Theory offers a fundamental framework for choosing the best imputation techniques. Statistical Theory of Imputation estimates missing values by using models such as multiple imputation, which preserves uncertainty by making many credible guesses. In contrast, machine learning theories use data-driven insights to forecast missing values based on observed patterns using algorithms like KNN and neural networks. Many ways are available with trade-offs in simplicity, accuracy, and processing needs. Examples of these techniques are deletion methods, single imputation (e.g., mean imputation), and multiple imputation (e.g., MICE). These techniques are validated by comparative and simulation studies, which assess their efficacy in various circumstances and datasets. Comprehending these theories and approaches is essential for minimizing biases and optimizing data utility in analytical tasks, highlighting the multidisciplinary aspect of tackling difficulties related to missing data in practical applications.

Project Purpose

The purpose of this project is to demonstrate the use of several different types of missing data imputation methods using a test dataset with missing data. The performance of various imputation methods will be assessed using the root mean squared error (RMSE) of the resulting predictive models after applying various imputation methods.

Methods(RS,EM)

An overview of the methodology used to demonstrate different techniques for imputing missing data is shown below:

Figure 2: Overview of Methodology for Demonstrating Different Techniques for Missing Data Imputation. Figured generated by Robert Stairs, 14JUL2024.

The data were first pre-processed. This step typically entails importing the data, checking for missing values, checking data types, and making transformations as needed. In this case, all columns were converted to numeric data type and renamed for easier interpretation. Missing values by column were reported in summary statistics, but were not addressed until further visualization was performed.

Following pre-processing, correlation coefficients were calculated for all variables with respect to Ozone levels (the response variable). Variables with strong positive or negative correlation were then visualized using histograms to show the distribution of each variable with respect to percentage of Ozone level readings. Missing data were then visualized to determine if there were any significant patterns to the missingness of the data. The purpose of this step is to provide evidence to support that the data are either MCAR, MAR, or MNAR.

Data were classified as either MCAR or not MCAR using Hawkin’s Test, Non-Parametric Test, and Little’s MCAR Test (Li 2013). The Hawkin’s Test checks for both multivariate normality and homoscedasticity (equal variances), and assumes normality. The Non-Parametric test checks for homoscedasticity without assuming normality. Lastly, the Little’s MCAR test is a chi-squared test that evaluates whether mean differences exists between groups with different missing data patterns. In these tests, the hypotheses are as follows:

  • Null Hypothesis (H0): The data is missing completely at random (MCAR).
  • Alternative Hypothesis (H1): The data is not missing completely at random.

Note, this procedure is only designed to test between MCAR/not MCAR and is not able to classify data as MAR or MNAR. Since the only true way to distinguish between MNAR and MAR is to obtain the actual missing values, we will rely on the MCAR tests and the following visualizations to best determine the pattern of missingness.

Visualizations used included the gg_miss_fct() function to plot the proportion of missing values for each column (y-axis) broken down by one variable at a time (x-axis); the vis_miss() function which provides a ggplot of the missingness inside a dataframe, colouring cells according to missingness, where black indicates a missing cell and grey indicates a present cell; the gg_miss_upset() function which shows the number of missing values by each possible combination of missing values (intersections); and the gg_miss_var() function which shows the number of missing values for each variable.

The purpose of this is to determine if there are discernible patterns in the missingness of the data. This provides additional supportive information for classifying the data as MCAR, MAR, or MNAR.

Once pre-processing was completed, the dataset was split into test and training datasets. Missing data were then imputed with various methodologies, including list-wise deletion, feature removal, mean imputation, mode imputation, median imputation, MICE, and missForest. This resulted in 7 unique datasets, where missing data were imputed using different methodologies.

Missing Value Deletion Methods

The following methods were used to delete missing data for this project:

List-wise Deletion: all rows containing missing values are removed from the dataset

Column Delete with Threshold (Feature Selection): if the total percentage of missing values in a column is over a threshold (20%), then the entire column/feature is removed. Mean imputation was used to fill in the remaining missing values.

Imputation Methods

The following methods were used to impute missing data for this project:

Mean imputation: missing values were replaced with the mean of the non-missing values of the corresponding feature.

Mode imputation: missing values were replaced with the mode of the non-missing values of the corresponding feature.

Median imputation: missing values were replaced with the median of the non-missing values of the corresponding feature.

MICE (multiple imputation by chained equations): missing data for a given column were imputed based on other columns with observed data over multiple iterations (Azur et al. 2011). It operates under the assumption that the data are missing at random (MAR). For a missing column, the missing values were imputed using regression based on observed data from other columns in the dataset. Then, the next column’s missing data were imputed using observed data in other columns and so forth until regression models are available for each column. Each regression model has its own level of uncertainty. With each iteration, a possible missing value is imputed for a given data point given the regression function and its associated uncertainty. This process was repeated over multiple iterations, where the number of iterations was set to 50. This resulted in multiple datasets, which were then combined into a single dataset again (Buuren 2018).

Figure 3: Multiple Imputation versus Single Imputation Methods, (Ren et al. 2023)

missForest: used a random forest trained on observed data to predict missing values in continuous and categorical data. It is an ensemble machine learning method, where the data set was bootstrap sampled randomly and individual decision trees were constructed for each bootstrap (Tang 2017). The collection of decision trees was then used to impute a missing data point. This technique in machine learning helps prevent over-fitting that decision trees are prone to.

Machine Learning Methods for Model-Fitting

Once missing values were imputed using each of the seven methods outlined above, each resulting dataset (7) was used to fit models for Ozone levels. The following models were then used for training each dataset:

Decision Tree:

Decision trees are a tree-structured classifier used for both classification and regression. It splits the data into subsets based on the value of input features, forming a tree where each node represents a feature and each branch represents a decision rule. Advantages include: Easy to interpret and visualize, can handle mixed-type data, non-parametric making it flexible for various data distributions, requires little data pre-processing. Disadvantages include: Prone to over-fitting, sensitive to small variations in the data, can create biased trees if some classes/features are dominant.

Figure 4: Visual representation of decision tree algorithm (https://365datascience.com/tutorials/machine-learning-tutorials/decision-trees/)

Random Forest:

Random forest is an ensemble learning method primarily used for classification and regression. It combines the predictions of multiple decision trees to improve accuracy and control over-fitting. It construct the decision trees during training and outputting the mode of the classes (classification) or mean prediction (regression) of the individual trees. Advantages include: high accuracy and robustness against over-fitting, works well with large datasets and high-dimensional spaces, handles both classification and regression tasks. Disadvantages include: can be computationally intensive and slower to predict, less interpretable compared to a single decision tree.

Figure 5: Visual representation of random forest algorithm (https://medium.com/(denizgunay/random-forest-af5bde5d7e1e?))

K-Nearest Neighbors (KNN):

KNN is a simple, instance-based learning algorithm used for classification and regression. It classifies a data point based based on the stored instances from the training data without learning explicit parameters, but instead on how its neighbors are classified. Advantages include: simple and easy to implement, algorithm doesn’t have a separate training phase making it more flexible. Disadvantages include: Computationally expensive for large datasets due to the need to compute distances to all points, sensitive to irrelevant or redundant features, performance depends on the choice of the number of k and the distance metric used.

Figure 6: Visual representation of KNN algorithm, where k is the number of nearest neighbors (https://towardsdatascience.com/why-does-increasing-k-decrease-variance-in-knn-9ed6de2f5061)

The effectiveness of each imputation technique was evaluated using RMSE for the resulting Ozone level models. The RMSE for each imputation/model combination were reported and summarized in a data frame. The combination with the lowest RMSE was determined to be the most effective imputation technique and model fit for this dataset.

Analysis and Results

Description of the Dataset

The Ozone dataset (“Ozone: Los Angeles Ozone Pollution Data, 1976”) from the mlbench package in R was chosen for this project. It contains observations related to pollution levels in the Los Angeles area during 1976. The variables contained in the dataset are thought to influence ozone concentration. These include daily ozone measurements consisting of:

  • V1: Month, 1-12, where 1 is January and 12 is December
  • V2: Day of month
  • V3: Day of week, 1-7, where 1 is Monday and 7 is Sunday
  • V4: Daily maximum one-hour-average ozone reading
  • V5: 500 millibar pressure height (m) measured at Vandenberg AFB
  • V6: Wind speed (mph) at Los Angeles International Airport (LAX)
  • V7: Humidity (%) at LAX
  • V8: Temperature (degrees F) measured at Sandburg, CA
  • V9: Temperature (degrees F) measured at El Monte, CA
  • V10: Inversion base height (feet) at LAX
  • V11: Pressure gradient (mm Hg) from LAX to Daggett, CA
  • V12: Inversion base temperature (degrees F) at LAX
  • V13: Visibility (miles) measured at LAX

The dataset allows for analysis of how different meteorological conditions/factors can affect ozone levels. This dataset was chosen because it already has a high frequency of missing data, which is representative of a real-world application for missing data. It contains a total of 13 variables, 366 observations (one for each day for one year), and 139 missing or NA values for V9, which represents 38% of the total observations in the dataset. In practical applications, the missing values are unknown (not simulated), so it is up to the investigator to choose a methodology for handling the missing data. The investigator will also need to choose appropriate metrics for evaluating the effectiveness of different methods for handling missing data.

Load Packages

Code
knitr::opts_chunk$set(echo = TRUE)
#install.packages('mlbench')

library(mlbench)
library(dplyr)
library(ggplot2)
library(vtable)
library(knitr)
library(kableExtra)
library(tidyverse)
library(tidyr)
library(naniar)
library(ggplot2)
library(UpSetR)
library(corrplot)
library(gridExtra)
library(reshape2)
library(ggpubr)
library(caTools)
library(party)
library(magrittr)
library(ggfittext)

#additional libraries for model fitting and train/test split
library(caret) # for fitting KNN models
library(e1071)
library(rsample) # for creating validation splits
library(recipes)    # for feature engineering
library(randomForest)
library(rpart)# decision tree
library(tidymodels) 
library(class) 
library(vip) 

Importing and Summarizing the Data

First, the data were imported and converted to a data frame. Next, columns were reordered such that V4 (response variable) is the first column. Lastly, the data were summarized using the summary() function, showing the number of missing values per variable. The number missing or NA values in the dataset totaled 203 out of a total of 3660 data points. The number of missing values (NA’s) in each column are as follows:

  • V4: 5
  • V1: None
  • V2: None
  • V3: None
  • V5: 12
  • V6: None
  • V7: 15
  • V8: 2
  • V9: 139
  • V10: 15
  • V11: 1
  • V12: 14
  • V13: None

Import Data

Code
# import data
data("Ozone", package = "mlbench")

# convert to data frame
ozone1 <- data.frame(Ozone)
# Reorder columns
ozone1 <- ozone1 %>%
  dplyr::select(V4,V1,V2,V3,V5,V6,V7,V8,V9,V10,V11,V12,V13)

# View summary of data
summary(ozone1)
       V4              V1            V2      V3           V5      
 Min.   : 1.00   1      : 31   1      : 12   1:52   Min.   :5320  
 1st Qu.: 5.00   3      : 31   2      : 12   2:52   1st Qu.:5700  
 Median : 9.00   5      : 31   3      : 12   3:52   Median :5770  
 Mean   :11.53   7      : 31   4      : 12   4:53   Mean   :5753  
 3rd Qu.:16.00   8      : 31   5      : 12   5:53   3rd Qu.:5830  
 Max.   :38.00   10     : 31   6      : 12   6:52   Max.   :5950  
 NA's   :5       (Other):180   (Other):294   7:52   NA's   :12    
       V6               V7              V8              V9       
 Min.   : 0.000   Min.   :19.00   Min.   :25.00   Min.   :27.68  
 1st Qu.: 3.000   1st Qu.:49.00   1st Qu.:51.00   1st Qu.:49.73  
 Median : 5.000   Median :65.00   Median :62.00   Median :57.02  
 Mean   : 4.869   Mean   :58.48   Mean   :61.91   Mean   :56.85  
 3rd Qu.: 6.000   3rd Qu.:73.00   3rd Qu.:72.00   3rd Qu.:66.11  
 Max.   :11.000   Max.   :93.00   Max.   :93.00   Max.   :82.58  
                  NA's   :15      NA's   :2       NA's   :139    
      V10            V11             V12             V13       
 Min.   : 111   Min.   :-69.0   Min.   :27.50   Min.   :  0.0  
 1st Qu.: 890   1st Qu.:-10.0   1st Qu.:51.26   1st Qu.: 70.0  
 Median :2125   Median : 24.0   Median :62.24   Median :110.0  
 Mean   :2591   Mean   : 17.8   Mean   :60.93   Mean   :123.3  
 3rd Qu.:5000   3rd Qu.: 45.0   3rd Qu.:70.52   3rd Qu.:150.0  
 Max.   :5000   Max.   :107.0   Max.   :91.76   Max.   :500.0  
 NA's   :15     NA's   :1       NA's   :14                     

Data Pre-Processing

The data type was evaluated for each column, showing that V1, V2, and V3 (Month, day of month, and day of week) were found to be “factor” data type, while the remaining columns were found to be be “numeric”. Factor columns were then converted to numeric data type.

Check data types

Code
# Create function to check data types of data frame columns
check_data_types <- function(ozone1) {
  sapply(ozone1, class)
}

# Check data types of the columns
data_types <- check_data_types(ozone1)
print(data_types)
       V4        V1        V2        V3        V5        V6        V7        V8 
"numeric"  "factor"  "factor"  "factor" "numeric" "numeric" "numeric" "numeric" 
       V9       V10       V11       V12       V13 
"numeric" "numeric" "numeric" "numeric" "numeric" 
Code
# Function to convert specified columns from factor to numeric
convert_factors_to_numeric <- function(data, columns) {
  data[columns] <- lapply(data[columns], function(x) as.numeric(as.character(x)))
  return(data)
}

# Convert first 3 columns from factor to numeric
columns_to_convert <- c("V1", "V2", "V3")
ozone_date <- convert_factors_to_numeric(ozone1, columns_to_convert)
ozone2 <- convert_factors_to_numeric(ozone1, columns_to_convert)

# Check data types of the columns again
data_types <- sapply(ozone_date, class)
print(data_types)
       V4        V1        V2        V3        V5        V6        V7        V8 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
       V9       V10       V11       V12       V13 
"numeric" "numeric" "numeric" "numeric" "numeric" 
Code
data_types2 <- sapply(ozone2, class)
print(data_types2)
       V4        V1        V2        V3        V5        V6        V7        V8 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
       V9       V10       V11       V12       V13 
"numeric" "numeric" "numeric" "numeric" "numeric" 

A new column (Date) was created that combined the columns month, day of month, and day of week. This was done in order to run further analysis.

Code
# combine Day and Month to create a Date column
ozone_date <- ozone_date %>%
  mutate(Date = as.Date(paste(1976, V1, V2, sep = "-"), format = "%Y-%m-%d"))

# Verify the new Date column
head(ozone_date, n=10)
   V4 V1 V2 V3   V5 V6 V7 V8    V9  V10 V11   V12 V13       Date
1   3  1  1  4 5480  8 20 NA    NA 5000 -15 30.56 200 1976-01-01
2   3  1  2  5 5660  6 NA 38    NA   NA -14    NA 300 1976-01-02
3   3  1  3  6 5710  4 28 40    NA 2693 -25 47.66 250 1976-01-03
4   5  1  4  7 5700  3 37 45    NA  590 -24 55.04 100 1976-01-04
5   5  1  5  1 5760  3 51 54 45.32 1450  25 57.02  60 1976-01-05
6   6  1  6  2 5720  4 69 35 49.64 1568  15 53.78  60 1976-01-06
7   4  1  7  3 5790  6 19 45 46.40 2631 -33 54.14 100 1976-01-07
8   4  1  8  4 5790  3 25 55 52.70  554 -28 64.76 250 1976-01-08
9   6  1  9  5 5700  3 73 41 48.02 2083  23 52.52 120 1976-01-09
10  7  1 10  6 5700  3 59 44    NA 2654  -2 48.38 120 1976-01-10

Rename columns for clearer interpretation during analysis

Code
# rename columns
ozone2 <- plyr::rename(ozone1, c('V4'="Ozone_reading",
                                 'V1'="Month", 
                                 'V2'="Day_of_month",
                                 'V3'="Day_of_week", 
                                 'V5'="Pressure_afb", 
                                 'V6'="Wind_speed_LAX", 
                                 'V7'="Humidity_LAX", 
                                 'V8'="Temp_sandburg", 
                                 'V9'="Temp_EM", 
                                 'V10'="IBH_LAX", 
                                 'V11'="Pressure_gradient", 
                                 'V12'="IBT_LAX", 
                                 'V13'="Visibility_LAX"))

ozone_date2 <- plyr::rename(ozone_date, c('V4'="Ozone_reading",
                                 'V1'="Month", 
                                 'V2'="Day_of_month",
                                 'V3'="Day_of_week", 
                                 'V5'="Pressure_afb", 
                                 'V6'="Wind_speed_LAX", 
                                 'V7'="Humidity_LAX", 
                                 'V8'="Temp_sandburg", 
                                 'V9'="Temp_EM", 
                                 'V10'="IBH_LAX", 
                                 'V11'="Pressure_gradient", 
                                 'V12'="IBT_LAX", 
                                 'V13'="Visibility_LAX"))
head(ozone2, n=10)
   Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
1              3     1            1           4         5480              8
2              3     1            2           5         5660              6
3              3     1            3           6         5710              4
4              5     1            4           7         5700              3
5              5     1            5           1         5760              3
6              6     1            6           2         5720              4
7              4     1            7           3         5790              6
8              4     1            8           4         5790              3
9              6     1            9           5         5700              3
10             7     1           10           6         5700              3
   Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
1            20            NA      NA    5000               -15   30.56
2            NA            38      NA      NA               -14      NA
3            28            40      NA    2693               -25   47.66
4            37            45      NA     590               -24   55.04
5            51            54   45.32    1450                25   57.02
6            69            35   49.64    1568                15   53.78
7            19            45   46.40    2631               -33   54.14
8            25            55   52.70     554               -28   64.76
9            73            41   48.02    2083                23   52.52
10           59            44      NA    2654                -2   48.38
   Visibility_LAX
1             200
2             300
3             250
4             100
5              60
6              60
7             100
8             250
9             120
10            120
Code
head(ozone_date2, n=10)
   Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
1              3     1            1           4         5480              8
2              3     1            2           5         5660              6
3              3     1            3           6         5710              4
4              5     1            4           7         5700              3
5              5     1            5           1         5760              3
6              6     1            6           2         5720              4
7              4     1            7           3         5790              6
8              4     1            8           4         5790              3
9              6     1            9           5         5700              3
10             7     1           10           6         5700              3
   Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
1            20            NA      NA    5000               -15   30.56
2            NA            38      NA      NA               -14      NA
3            28            40      NA    2693               -25   47.66
4            37            45      NA     590               -24   55.04
5            51            54   45.32    1450                25   57.02
6            69            35   49.64    1568                15   53.78
7            19            45   46.40    2631               -33   54.14
8            25            55   52.70     554               -28   64.76
9            73            41   48.02    2083                23   52.52
10           59            44      NA    2654                -2   48.38
   Visibility_LAX       Date
1             200 1976-01-01
2             300 1976-01-02
3             250 1976-01-03
4             100 1976-01-04
5              60 1976-01-05
6              60 1976-01-06
7             100 1976-01-07
8             250 1976-01-08
9             120 1976-01-09
10            120 1976-01-10

Features Legend: * Ozone_reading: Daily maximum one-hour-average ozone reading * Month: 1 = January, …, 12 = December * Day of month: 1-30/31 * Day of week: 1 = Monday, …, 7 = Sunday * Pressure_afb: 500 millibar pressure height (m) measured at Vandenberg AFB * Wind_speed_LAX: Wind speed (mph) at Los Angeles International Airport (LAX) * Humidity_LAX: Humidity (%) at LAX * Temp_sandburg: Temperature (degrees F) measured at Sandburg, CA * Temp_EM: Temperature (degrees F) measured at El Monte, CA * IBH_LAX: Inversion base height (feet) at LAX * Pressure_gradient: Pressure gradient (mm Hg) from LAX to Daggett, CA * IBT_LAX: Inversion base temperature (degrees F) at LAX * Visibility_LAX: Visibility (miles) measured at LAX

Data Exploration

Summary Statistics by Day of Week The only notable trend in ozone levels by day of the week was that mean and median ozone levels appear to be lower on day 4 of the week (Thursdays).

Code
# Summary statistics by day of the week 
ozone_summary_by_day <- ozone2 %>%
  group_by(Day_of_week) %>%
  summarize(
    mean_ozone = mean(Ozone_reading, na.rm = TRUE),
    median_ozone = median(Ozone_reading, na.rm = TRUE),
    max_ozone = max(Ozone_reading, na.rm = TRUE),
    min_ozone = min(Ozone_reading, na.rm = TRUE)
  )

print(ozone_summary_by_day)
# A tibble: 7 × 5
  Day_of_week mean_ozone median_ozone max_ozone min_ozone
  <fct>            <dbl>        <dbl>     <dbl>     <dbl>
1 1                11.6           9          38         2
2 2                12.6           9.5        34         2
3 3                12.6          12          33         2
4 4                 9.58          7          26         1
5 5                11.2           9          32         2
6 6                12.1          10          33         2
7 7                10.9          10          27         1

Summary Statistics by Month Mean and median ozone levels were noted to be lowest in the months of January, February, and March on average. They increase during the summer months, peaking in July, and gradually decline through the remainder of the year.

Code
ozone_summary_by_month <- ozone2 %>%
  group_by(Month) %>%
  summarize(
    mean_ozone = mean(Ozone_reading, na.rm = TRUE),
    median_ozone = median(Ozone_reading, na.rm = TRUE),
    max_ozone = max(Ozone_reading, na.rm = TRUE),
    min_ozone = min(Ozone_reading, na.rm = TRUE)
  )

print(ozone_summary_by_month)
# A tibble: 12 × 5
   Month mean_ozone median_ozone max_ozone min_ozone
   <fct>      <dbl>        <dbl>     <dbl>     <dbl>
 1 1           5.45          5          11         3
 2 2           7.14          6          23         2
 3 3           8.84          9          24         2
 4 4          10             9          22         3
 5 5          15.8          16          33         2
 6 6          17.1          17          30         2
 7 7          20.2          19          34        10
 8 8          17.5          16          38         2
 9 9          12.4          11.5        29         2
10 10         12.6          12          30         3
11 11          7.67          5.5        20         1
12 12          4.06          4           8         1

Examine how feature variables correlate with Ozone levels

Correlation coefficients were determined for each variable with respect to Ozone levels (response variable). A bar graph was used to summarize the correlation coefficients. The following correlations were noted:

  • Strong Positive Correlations: Humidity_LAX, Pressure_afb, IBT_LAX, Temp_EM, and Temp_sandburg
  • Strong Negative Correlations: IBH_LAX and Visibility_LAX
Code
# Make sure data is in numeric form
ozone2[] <- lapply(ozone2, as.numeric)

# Calculate correlations with Ozone
corr_coeffs <- cor(ozone2, use = "complete.obs")['Ozone_reading', ]
corr_coeffs <- corr_coeffs[!names(corr_coeffs) %in% 'Ozone_reading']

# Create a data frame for plotting
corr_df <- data.frame(Variable = names(corr_coeffs), Correlation = corr_coeffs)

# Create the bar graph
ggplot(corr_df, aes(x = reorder(Variable, Correlation), y = Correlation)) +
  geom_bar(stat = 'identity',fill = "skyblue") +
  xlab('Variable') +
  ylab('Correlation Coefficient') +
  ggtitle('Correlation Between Variables and Daily Average Ozone Reading') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Explore Interactions with Positively Correlated Variables

Strong Positive Correlations: Humidity_LAX, Pressure_afb, IBT_LAX, Temp_EM, and Temp_sandburg

The percentage of ozone readings appears to be normally distributed with respect to Pressure_afb, IBT_LAX, and Temp_sandburg. There appears to be some negative skew with respect to humidity, with a secondary mode at the lowest humidity bin (10-20). There are a lot of missing data points for Temp_EM (139 observations out of 366), as discussed in the initial data summary. This represents 38.0% of the observations for Temp_EM.

Code
## Humidity
# Create bins for humidity levels
ozone3 <- ozone2 %>%
  mutate(humidity_bin = cut(Humidity_LAX, breaks = seq(10, 100, by = 10), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data1 <- ozone3 %>%
  group_by(humidity_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data1, aes(x = humidity_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by Humidity Levels",
       x = "Humidity Levels",
       y = "Percentage of Ozone Readings") +
  theme_minimal()

Code
## Pressure
# Create bins for humidity levels
ozone3 <- ozone2 %>%
  mutate(pressure_bin = cut(Pressure_afb, breaks = seq(5300, 6000, by = 100), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data2 <- ozone3 %>%
  group_by(pressure_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data2, aes(x = pressure_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by Pressure Levels",
       x = "Pressure Levels",
       y = "Percentage of Ozone Readings") +
  theme(axis.text.x = element_text(angle = 45, vjust=0.5, size=8))

Code
## IBT - Inversion base temperature (degrees F) at LAX
# Create bins for Inversion base temp levels
ozone3 <- ozone2 %>%
  mutate(IBT_bin = cut(IBT_LAX, breaks = seq(20, 100, by = 10), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data3 <- ozone3 %>%
  group_by(IBT_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data3, aes(x = IBT_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by Inversion base temp Levels",
       x = "Inversion base temp Levels",
       y = "Percentage of Ozone Readings") +
  theme_minimal()

Code
## Temp_EM - Temperature (degrees F) measured at El Monte, CA
# Create bins for temp levels
ozone3 <- ozone2 %>%
  mutate(Temp_EM_bin = cut(Temp_EM, breaks = seq(20, 100, by = 10), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data4 <- ozone3 %>%
  group_by(Temp_EM_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data4, aes(x = Temp_EM_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by Temp at El Monte Levels",
       x = "Temp Levels",
       y = "Percentage of Ozone Readings") +
  theme_minimal()

Code
## Temp_sandburg
# Create bins for temp levels
ozone3 <- ozone2 %>%
  mutate(Temp_sd_bin = cut(Temp_sandburg, breaks = seq(20, 100, by = 10), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data5 <- ozone3 %>%
  group_by(Temp_sd_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data5, aes(x = Temp_sd_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by Temp at Sandburg Levels",
       x = "Temp Levels",
       y = "Percentage of Ozone Readings") +
  theme_minimal()

Explore Interactions with Negatively Correlated Variables

  • Strong Negative Correlations: IBH_LAX and Visibility_LAX The percentage of ozone readings does not appear to be normally distributed with respect to either variable. The highest proportion of Ozone readings appear at high IBH_LAX levels (4600-5100), with a secondary mode at low IBH_LAX levels (600-1100). The proportion of Ozone readings are positively skewed with respect to Visbility_LAX, with the mode occurring between 50 and 100, and a secondary mode occuring between 250 and 300.
Code
## IBH_LAX - Inversion base height (feet) at LAX
# Create bins for IBH levels
ozone3 <- ozone2 %>%
  mutate(IBH_bin = cut(IBH_LAX, breaks = seq(100, 5500, by = 500), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data6 <- ozone3 %>%
  group_by(IBH_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data6, aes(x = IBH_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by IBH Levels",
       x = "IBH Levels",
       y = "Percentage of Ozone Readings") +
  theme(axis.text.x = element_text(angle = 45, vjust=0.5))

Code
## Visibility
# Create bins for Visibility levels
ozone3 <- ozone2 %>%
  mutate(visibility_bin = cut(Visibility_LAX, breaks = seq(0, 500, by = 50), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data7 <- ozone3 %>%
  group_by(visibility_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data7, aes(x = visibility_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by Visibility Levels",
       x = "Visibility Levels",
       y = "Percentage of Ozone Readings") +
  theme_minimal()

Correlation Matrix for All Columns

The correlation matrix shows a multiple strong correlations between the predictors in the model. This also suggests high collinearity between variables. Some of the strongest correlations are as follows (|correlation| >0.5):

  • Pressure_afb: Temp_Sandburg, Temp_EM, IBH_LAX, IBT_LAX
  • Humidity_LAX: Pressure gradient
  • Temp_Sandburg: Pressure_afb, Temp_EM, IBH_LAX, IBT_LAX
  • Temp_EM: Pressure_afb, Temp_Sandburg, IBH_LAX, IBT_LAX
  • IBH_LAX: Pressure_afb, Temp_Sandburg, Temp_EM, IBT_LAX
  • Pressure_gradient: Humidity_LAX
  • IBT_LAX: Pressure_afb, Temp_Sandburg, Temp_EM, IBH_LAX
Code
library("corrplot")

#Delete NA values to allow calculation of correlation coefficients
ozone_delete_NA <- na.omit(ozone2)


#Make the correlation matrix
ozone_cor = cor(ozone_delete_NA)
ozone_cor
                  Ozone_reading        Month Day_of_month   Day_of_week
Ozone_reading        1.00000000  0.042423530  0.085732669 -3.551708e-02
Month                0.04242353  1.000000000  0.029780944 -6.406562e-03
Day_of_month         0.08573267  0.029780944  1.000000000  3.418381e-03
Day_of_week         -0.03551708 -0.006406562  0.003418381  1.000000e+00
Pressure_afb         0.59442132  0.337931827  0.158080640 -2.206218e-02
Wind_speed_LAX       0.08415020 -0.226893006 -0.046090839 -3.667633e-02
Humidity_LAX         0.48108120 -0.034727288 -0.064739863 -3.855381e-02
Temp_sandburg        0.77246956  0.235445072  0.157156363 -3.035349e-02
Temp_EM              0.75902225  0.314323892  0.117127229 -2.481044e-02
IBH_LAX             -0.55392523  0.045305170 -0.082352709  7.998485e-02
Pressure_gradient    0.17315141 -0.218837079 -0.111239793  3.418479e-02
IBT_LAX              0.71756579  0.236540625  0.127530054 -5.365959e-02
Visibility_LAX      -0.47604005 -0.167796386 -0.057896954 -8.572216e-06
                  Pressure_afb Wind_speed_LAX Humidity_LAX Temp_sandburg
Ozone_reading       0.59442132     0.08415020   0.48108120    0.77246956
Month               0.33793183    -0.22689301  -0.03472729    0.23544507
Day_of_month        0.15808064    -0.04609084  -0.06473986    0.15715636
Day_of_week        -0.02206218    -0.03667633  -0.03855381   -0.03035349
Pressure_afb        1.00000000    -0.23161673   0.03869121    0.80633038
Wind_speed_LAX     -0.23161673     1.00000000   0.30356343    0.04122208
Humidity_LAX        0.03869121     0.30356343   1.00000000    0.33132296
Temp_sandburg       0.80633038     0.04122208   0.33132296    1.00000000
Temp_EM             0.89689385    -0.06983510   0.21158607    0.91396229
IBH_LAX            -0.50891157     0.12834881  -0.24703914   -0.51539621
Pressure_gradient  -0.24549047     0.37328762   0.62433536    0.11765666
IBT_LAX             0.85642134    -0.12959891   0.19101936    0.84310310
Visibility_LAX     -0.34272720     0.04534341  -0.45750232   -0.41038641
                      Temp_EM     IBH_LAX Pressure_gradient     IBT_LAX
Ozone_reading      0.75902225 -0.55392523        0.17315141  0.71756579
Month              0.31432389  0.04530517       -0.21883708  0.23654062
Day_of_month       0.11712723 -0.08235271       -0.11123979  0.12753005
Day_of_week       -0.02481044  0.07998485        0.03418479 -0.05365959
Pressure_afb       0.89689385 -0.50891157       -0.24549047  0.85642134
Wind_speed_LAX    -0.06983510  0.12834881        0.37328762 -0.12959891
Humidity_LAX       0.21158607 -0.24703914        0.62433536  0.19101936
Temp_sandburg      0.91396229 -0.51539621        0.11765666  0.84310310
Temp_EM            1.00000000 -0.57965832       -0.12091597  0.93080989
IBH_LAX           -0.57965832  1.00000000        0.11350236 -0.78286145
Pressure_gradient -0.12091597  0.11350236        1.00000000 -0.20663872
IBT_LAX            0.93080989 -0.78286145       -0.20663872  1.00000000
Visibility_LAX    -0.43897902  0.39669789       -0.12005488 -0.43771768
                  Visibility_LAX
Ozone_reading      -4.760401e-01
Month              -1.677964e-01
Day_of_month       -5.789695e-02
Day_of_week        -8.572216e-06
Pressure_afb       -3.427272e-01
Wind_speed_LAX      4.534341e-02
Humidity_LAX       -4.575023e-01
Temp_sandburg      -4.103864e-01
Temp_EM            -4.389790e-01
IBH_LAX             3.966979e-01
Pressure_gradient  -1.200549e-01
IBT_LAX            -4.377177e-01
Visibility_LAX      1.000000e+00
Code
#Make the plot of the correlation matrix
corrplot(ozone_cor)

Exploration of the Missing Data

The total number of missing data points for each column is shown below as a count and as a percentage. Most of the columns contain <5% missing values, with respect to the total values observed for that feature. Temp_EM contains 139 missing values, which is 38.0% of this features observations. The total number of missing values in the dataset is 203 out of 4,768 (13 columns times 366 observations). This represents 4.3% of the entire dataset.

Summarize missing values

Code
# Function to summarize missing values in a data frame
summarize_missing_values <- function(data) {
  data %>%
    summarize_all(~ sum(is.na(.))) %>%
    gather(key = "column", value = "missing_values") %>%
    mutate(missing_percentage = (missing_values / nrow(data)) * 100)
}

missing_summary <- summarize_missing_values(ozone2)

# Print the summary of missing values
print(missing_summary)
              column missing_values missing_percentage
1      Ozone_reading              5          1.3661202
2              Month              0          0.0000000
3       Day_of_month              0          0.0000000
4        Day_of_week              0          0.0000000
5       Pressure_afb             12          3.2786885
6     Wind_speed_LAX              0          0.0000000
7       Humidity_LAX             15          4.0983607
8      Temp_sandburg              2          0.5464481
9            Temp_EM            139         37.9781421
10           IBH_LAX             15          4.0983607
11 Pressure_gradient              1          0.2732240
12           IBT_LAX             14          3.8251366
13    Visibility_LAX              0          0.0000000
Code
print(sum(is.na(ozone2)))
[1] 203

Visualization of the Missing Data

Three different plots were generated to visualize the missing data.

The first plot uses the vis_miss() function. Since the rows are arranged by date, we can see that most of the data are missing in random clusters with respect to date. Temp_EM, which has the highest proportion of missing data (38%), appears to be missing in somewhat regular intervals. This was further explored in the next section.

Code
# Plot missing data pattern
#Shows what percentage of the data are missing from each column
vis_miss(ozone2)

The second plot uses the gg_miss_upset() function. This plot shows the number of missing values not only by each column, but by each possible combination of missing values (intersections). For example, the third bar in this figure shows that there are 9 rows that are missing values for IBT_LAX, Humidity_LAX, and IBH_LAX. As stated previously, the majority of the missing values belong to Temp_EM, with 127 rows missing values for only this feature. The next highest intersection is Pressure_afb_NA, which contains 10 rows of missing values.

Code
#This plot gives a visual of what combinations of NAs are present and how many there are for each
#set nsets to 8 since we have 8 columns with missing data
gg_miss_upset(ozone2, nsets=8)

The third plot was generated using the gg_miss_var() function. Again, Temp_EM has by far the largest number of missing points (139).

Code
#Another way to visualize number of missing rows per column
gg_miss_var(ozone2) + labs(y = "Number of missing values") + ylim(0, 150)

Test Data for Normality Prior to testing the randomness of the data, it must be checked for normality in order to properly interpret the randomness results. The Shapiro-Wilk and the Anderson-Darling tests were performed. In addition, a histogram and QQ plot were rendered.

Code
# Extract the dependent variable
ozone_levels <- ozone2$Ozone_reading

# Remove NA values
ozone_levels <- na.omit(ozone_levels)

# Perform normality tests
library(nortest)
shapiro_test <- shapiro.test(ozone_levels)
ad_test <- ad.test(ozone_levels)

# Print the results
print(shapiro_test)

    Shapiro-Wilk normality test

data:  ozone_levels
W = 0.91044, p-value = 8.37e-14
Code
print(ad_test)

    Anderson-Darling normality test

data:  ozone_levels
A = 10.027, p-value < 2.2e-16
  • Null Hypothesis (H0): The data is normally distributed.
  • Alternative Hypothesis (H1): The data is not normally distributed.

Based on both p-values(8.37e-14 and 2.2e-16), the null hypothesis is rejected, data is not normally distributed.

A histogram and QQ plot were also created

Code
# Histogram
ggplot(ozone2, aes(x = Ozone_reading)) + geom_histogram(binwidth = 5) + ggtitle("Histogram of Ozone")

Code
# Q-Q plot
qqnorm(ozone2$Ozone_reading)
qqline(ozone2$Ozone_reading)

The histogram and QQ plot further support the conclusion that the data follows a non-normal distribution.

Randomness Testing of Missing Data

The Little’s MCAR test, the Hawkin’s test and the Non-Parametric test were performed to analyze whether the data followed an MCAR pattern or not.

Code
# Perform Little's MCAR test
mcar_result <- mcar_test(ozone2)
print(mcar_result)
# A tibble: 1 × 4
  statistic    df  p.value missing.patterns
      <dbl> <dbl>    <dbl>            <int>
1      278.   134 4.10e-12               13
  • Null Hypothesis (H0): The data is missing completely at random (MCAR).
  • Alternative Hypothesis (H1): The data is not missing completely at random.

Based on the p-value(4.102052e-12) the null hypothesis is rejected at the 0.05 significance level. However, since the test assumes that the data is from a multivariate normal distribution, we can conclude that the test is unreliable as our data is non-normal.

Run Hawkins and Non-Parametric Tests

Code
library(dplyr)
explanatory = c("Temp_EM","Month", "Day_of_month","Day_of_week", "Pressure_afb", "Wind_speed_LAX", "Humidity_LAX", "Temp_sandburg", "IBH_LAX", 
                "Pressure_gradient", "IBT_LAX", "Visibility_LAX")
dependent = "Ozone_reading"

# Run randomness test
ozone2 %>%
  select(explanatory)%>%
  MissMech::TestMCARNormality()
Call:
MissMech::TestMCARNormality(data = .)

Number of Patterns:  4 

Total number of cases used in the analysis:  354 

 Pattern(s) used:
          Temp_EM   Month   Day_of_month   Day_of_week   Pressure_afb
group.1        NA       1              1             1              1
group.2         1       1              1             1              1
group.3         1       1              1             1              1
group.4         1       1              1             1             NA
          Wind_speed_LAX   Humidity_LAX   Temp_sandburg   IBH_LAX
group.1                1              1               1         1
group.2                1              1               1         1
group.3                1             NA               1        NA
group.4                1              1               1         1
          Pressure_gradient   IBT_LAX   Visibility_LAX   Number of cases
group.1                   1         1                1               129
group.2                   1         1                1               206
group.3                   1        NA                1                 9
group.4                   1         1                1                10

    Test of normality and Homoscedasticity:
  -------------------------------------------

Hawkins Test:

    P-value for the Hawkins test of normality and homoscedasticity:  0.0113103 

    Either the test of multivariate normality or homoscedasticity (or both) is rejected.
    Provided that normality can be assumed, the hypothesis of MCAR is 
    rejected at 0.05 significance level. 

Non-Parametric Test:

    P-value for the non-parametric test of homoscedasticity:  0.2743229 

    Reject Normality at 0.05 significance level.
    There is not sufficient evidence to reject MCAR at 0.05 significance level.

Hawkins Test

  • Null Hypothesis (H0): The data is missing completely at random (MCAR).
  • Alternative Hypothesis (H1): The data is not missing completely at random.

Based on the p-value(0.0113103) the null hypothesis is rejected at the 0.05 significance level. Again, since the test assumes normality, we can also reject the results of this test.

Non-Parametric Test

  • Null Hypothesis (H0): The data is missing completely at random (MCAR).
  • Alternative Hypothesis (H1): The data is not missing completely at random.

Based on the p-value(0.2743229), there is not sufficient evidence to reject the null hypothesis at 0.05 significance level. Because this test does not assume normality, we conclude that the data is, in fact, MCAR.

Further Visualization of the Missing Data: Missingness Patterns

Focusing on Temp_EM since it is the column with the highest number of missing values, the data appear to be missing randomly with respect to most of the variables. With respect to the date columns, the pattern of missingness appears to be random with respect to day of the month. However, it is apparent that there is a high proportion of missing data on days 6 and 7 of the week (Saturdays and Sundays). This indicates that Temp_EM was likely not measured on the weekends throughout the year. The values for Temp_EM also appear to be missing randomly with respect to month, though there is a notable high frequency of missing data around the month of May.

For the other variables in the dataset, there does not appear to be any obvious patterns to the missing data. Most importantly, the frequency of missing data for Temp_EM does not appear to depend on the output variable (ozone reading). Therefore, we can most likely proceed to imputation with the assumption that our data are missing completely at random (MCAR).

Code
# Create gg_miss_fct plots with adjusted themes
p1 <- gg_miss_fct(ozone2, fct = Month) + 
  ggtitle("Missing Data by Month") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p2 <- gg_miss_fct(ozone2, fct = Day_of_month) + 
  ggtitle("Missing Data by Day of Month") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p3 <- gg_miss_fct(ozone2, fct = Day_of_week) + 
  ggtitle("Missing Data by Day of Week") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p4 <- gg_miss_fct(ozone2, fct = Ozone_reading) + 
  ggtitle("Missing Data by Ozone Reading") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p5 <- gg_miss_fct(ozone2, fct = Pressure_afb) + 
  ggtitle("Missing Data by Solar Radiation") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p6 <- gg_miss_fct(ozone2, fct = Wind_speed_LAX) + 
  ggtitle("Missing Data by Wind Speed") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p7 <- gg_miss_fct(ozone2, fct = Humidity_LAX) + 
  ggtitle("Missing Data by Humidity (LAX)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p8 <- gg_miss_fct(ozone2, fct = Temp_sandburg) + 
  ggtitle("Missing Data by Temperature (Sandburg)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p9 <- gg_miss_fct(ozone2, fct = Temp_EM) + 
  ggtitle("Missing Data by Temperature (EM)") +
  theme(plot.title = element_text(size=8))

p10 <- gg_miss_fct(ozone2, fct = IBH_LAX) + 
  ggtitle("Missing Data by IBH_LAX") +
  theme(plot.title = element_text(size=8))

p11 <- gg_miss_fct(ozone2, fct = Pressure_gradient) + 
  ggtitle("Missing Data by Pressure Gradient") +
  theme(plot.title = element_text(size=8))

p12 <- gg_miss_fct(ozone2, fct = IBT_LAX) + 
  ggtitle("Missing Data by IBT_LAX") +
  theme(plot.title = element_text(size=8))

p13 <- gg_miss_fct(ozone2, fct = Visibility_LAX) + 
  ggtitle("Missing Data by Visibility_LAX") +
  theme(plot.title = element_text(size=8))

# Arrange the plots into grids with proper spacing
grid1 <- grid.arrange(p1, p2, p3, p4, nrow = 2)

Code
grid2 <- grid.arrange(p5, p6, p7, p8, nrow = 2)

Code
grid3 <- grid.arrange(p9, p10, p11, nrow = 2)

Code
grid4 <- grid.arrange(p12, p13, nrow = 1)

Missing Data Imputation

Prior to imputation, the dataset was split into testing and training datasets using a split ratio of 25/75.. Missing value imputations were performed on the training data set.

Code
# Split the data into training and testing sets
set.seed(123)
ozone2_split <- initial_split(ozone2, prop = 0.75)
train_ozone2 <- training(ozone2_split)
test_ozone2 <- testing(ozone2_split)

head(train_ozone2)
    Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
179            13     6           27           7         5880              5
14              4     1           14           3         5780              6
195            13     7           13           2         5850              9
306             7    11            1           1         5860              3
118             6     4           27           2         5620              9
299            13    10           25           1         5820              5
    Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
179           43            90      NA     580                 9   87.26
14            19            54   56.12    5000               -44   56.30
195           72            77   69.44    3389                56   68.72
306           NA            66   65.12      NA               -32      NA
118           47            56   39.92    5000                43   38.30
299           71            48      NA    1899                21   62.06
    Visibility_LAX
179             80
14             200
195             80
306             60
118            140
299             40
Code
head(test_ozone2)
   Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
3              3     1            3           6         5710              4
6              6     1            6           2         5720              4
8              4     1            8           4         5790              3
12             6     1           12           1         5720              3
15             4     1           15           4         5830              3
17             5     1           17           6         5840              5
   Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
3            28            40      NA    2693               -25   47.66
6            69            35   49.64    1568                15   53.78
8            25            55   52.70     554               -28   64.76
12           44            51   54.32     111                 9   63.14
15           19            58   62.24    1249               -53   75.74
17           19            64      NA    5000               -40   63.32
   Visibility_LAX
3             250
6              60
8             250
12            150
15            250
17            200

Missing Data Imputation Using Simple Methods

Missing data were deleted using listwise deletion and feature selection as a baseline comparison for imputation methods. Additionally, missing data were imputed using mean, median and mode as a demonstration of simple imputation techniques. For feature removal, we decided on a missing value threshold of 20%. After each imputation, the data was checked again to make sure there were no remaining missing values. A trainin dataset was created for each method:

Feature selection (column deletion) The Temp_EM feature column exceeded the set threshold and was therefore removed from the dataset and the remaining missing values were also omitted.

Code
# Function to drop column if quantity of missing values is over the threshold
drop_na_columns <- function(data, threshold) {
  na_counts <- colSums(is.na(data))
  na_proportion <- na_counts / nrow(data)
  data <- data[, na_proportion <= threshold]
  return(data)
}
# Define threshold (e.g., 20% NA allowed)
threshold <- 0.20

# Drop columns based on the NA threshold for training dataset
dropCol_train <- drop_na_columns(train_ozone2, threshold) # the column Temp_EM gets dropped
dropCol_train <- data.frame(dropCol_train)
dropCol_train <- na.omit(dropCol_train)

head(dropCol_train)
    Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
179            13     6           27           7         5880              5
14              4     1           14           3         5780              6
195            13     7           13           2         5850              9
118             6     4           27           2         5620              9
299            13    10           25           1         5820              5
229             4     8           16           1         5780              7
    Humidity_LAX Temp_sandburg IBH_LAX Pressure_gradient IBT_LAX Visibility_LAX
179           43            90     580                 9   87.26             80
14            19            54    5000               -44   56.30            200
195           72            77    3389                56   68.72             80
118           47            56    5000                43   38.30            140
299           71            48    1899                21   62.06             40
229           70            66    5000                45   51.26            200
Code
# check to make sure no missing values remain
print(sum(is.na(dropCol_train)))
[1] 0
Code
# Drop columns based on the NA threshold for test dataset
dropCol_test <- drop_na_columns(test_ozone2, threshold) # the column Temp_EM gets dropped
dropCol_test <- data.frame(dropCol_train)
dropCol_test <- na.omit(dropCol_train)

head(dropCol_test)
    Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
179            13     6           27           7         5880              5
14              4     1           14           3         5780              6
195            13     7           13           2         5850              9
118             6     4           27           2         5620              9
299            13    10           25           1         5820              5
229             4     8           16           1         5780              7
    Humidity_LAX Temp_sandburg IBH_LAX Pressure_gradient IBT_LAX Visibility_LAX
179           43            90     580                 9   87.26             80
14            19            54    5000               -44   56.30            200
195           72            77    3389                56   68.72             80
118           47            56    5000                43   38.30            140
299           71            48    1899                21   62.06             40
229           70            66    5000                45   51.26            200
Code
# check to make sure no missing values remain
print(sum(is.na(dropCol_test)))
[1] 0

Listwise deletion (row deletion)

Create a new dataset where all rows with missing values are omitted.

Code
# Drop all missing values in training set. Row deletion.
dropNA_train <- na.omit(train_ozone2)

head(dropNA_train)
    Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
14              4     1           14           3         5780              6
195            13     7           13           2         5850              9
118             6     4           27           2         5620              9
229             4     8           16           1         5780              7
244            23     8           31           2         5950              8
365             1    12           30           4         5550              4
    Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
14            19            54   56.12    5000               -44   56.30
195           72            77   69.44    3389                56   68.72
118           47            56   39.92    5000                43   38.30
229           70            66   51.44    5000                45   51.26
244           61            93   81.68     620                27   85.64
365           85            39   41.00    5000                 8   39.92
    Visibility_LAX
14             200
195             80
118            140
229            200
244             30
365            100
Code
# check to make sure no missing values remain
print(sum(is.na(dropNA_train)))
[1] 0
Code
# Drop all missing values in test set. Row deletion.
dropNA_test <- na.omit(test_ozone2)

head(dropNA_test)
   Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
6              6     1            6           2         5720              4
8              4     1            8           4         5790              3
12             6     1           12           1         5720              3
15             4     1           15           4         5830              3
19             4     1           19           1         5680              5
21             4     1           21           3         5760              3
   Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
6            69            35   49.64    1568                15   53.78
8            25            55   52.70     554               -28   64.76
12           44            51   54.32     111                 9   63.14
15           19            58   62.24    1249               -53   75.74
19           73            52   56.48     393               -68   69.80
21           19            54   53.60    5000               -58   51.98
   Visibility_LAX
6              60
8             250
12            150
15            250
19             10
21            250
Code
# check to make sure no missing values remain
print(sum(is.na(dropNA_test)))
[1] 0

Mean imputation

Create a new dataset where all missing values are replaced with the mean of the feature column.

Code
# Function for mean imputation
mean_impute <- function(x) {
  x[is.na(x)] <- mean(x, na.rm = TRUE)
  return(x)
}

# impute on training set
mean_train <- apply(train_ozone2, 2, mean_impute)
mean_train <- data.frame(mean_train)

head(mean_train)
    Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
179            13     6           27           7         5880              5
14              4     1           14           3         5780              6
195            13     7           13           2         5850              9
306             7    11            1           1         5860              3
118             6     4           27           2         5620              9
299            13    10           25           1         5820              5
    Humidity_LAX Temp_sandburg  Temp_EM  IBH_LAX Pressure_gradient  IBT_LAX
179     43.00000            90 57.51663  580.000                 9 87.26000
14      19.00000            54 56.12000 5000.000               -44 56.30000
195     72.00000            77 69.44000 3389.000                56 68.72000
306     58.57955            66 65.12000 2659.303               -32 60.80204
118     47.00000            56 39.92000 5000.000                43 38.30000
299     71.00000            48 57.51663 1899.000                21 62.06000
    Visibility_LAX
179             80
14             200
195             80
306             60
118            140
299             40
Code
# check to make sure no missing values remain
print(sum(is.na(mean_train)))
[1] 0
Code
# impute on test set
mean_test <- apply(test_ozone2, 2, mean_impute)
mean_test <- data.frame(mean_test)

head(mean_train)
    Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
179            13     6           27           7         5880              5
14              4     1           14           3         5780              6
195            13     7           13           2         5850              9
306             7    11            1           1         5860              3
118             6     4           27           2         5620              9
299            13    10           25           1         5820              5
    Humidity_LAX Temp_sandburg  Temp_EM  IBH_LAX Pressure_gradient  IBT_LAX
179     43.00000            90 57.51663  580.000                 9 87.26000
14      19.00000            54 56.12000 5000.000               -44 56.30000
195     72.00000            77 69.44000 3389.000                56 68.72000
306     58.57955            66 65.12000 2659.303               -32 60.80204
118     47.00000            56 39.92000 5000.000                43 38.30000
299     71.00000            48 57.51663 1899.000                21 62.06000
    Visibility_LAX
179             80
14             200
195             80
306             60
118            140
299             40
Code
# check to make sure no missing values remain
print(sum(is.na(mean_test)))
[1] 0

Median imputation

Create a new dataset where all missing values are replaced with the median of the feature column.

Code
# Function for median imputation
median_impute <- function(x) {
  x[is.na(x)] <- median(x, na.rm = TRUE)
  return(x)
}

# impute on training set
median_train <- apply(train_ozone2, 2, median_impute)
median_train <- data.frame(median_train)

head(median_train)
    Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
179            13     6           27           7         5880              5
14              4     1           14           3         5780              6
195            13     7           13           2         5850              9
306             7    11            1           1         5860              3
118             6     4           27           2         5620              9
299            13    10           25           1         5820              5
    Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
179         43.0            90   57.38     580                 9   87.26
14          19.0            54   56.12    5000               -44   56.30
195         72.0            77   69.44    3389                56   68.72
306         65.5            66   65.12    2385               -32   62.06
118         47.0            56   39.92    5000                43   38.30
299         71.0            48   57.38    1899                21   62.06
    Visibility_LAX
179             80
14             200
195             80
306             60
118            140
299             40
Code
# check to make sure no missing values remain
print(sum(is.na(median_train)))
[1] 0
Code
# impute on test set
median_test <- apply(test_ozone2, 2, median_impute)
median_test <- data.frame(median_test)

head(median_test)
   Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
3              3     1            3           6         5710              4
6              6     1            6           2         5720              4
8              4     1            8           4         5790              3
12             6     1           12           1         5720              3
15             4     1           15           4         5830              3
17             5     1           17           6         5840              5
   Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
3            28            40   56.12    2693               -25   47.66
6            69            35   49.64    1568                15   53.78
8            25            55   52.70     554               -28   64.76
12           44            51   54.32     111                 9   63.14
15           19            58   62.24    1249               -53   75.74
17           19            64   56.12    5000               -40   63.32
   Visibility_LAX
3             250
6              60
8             250
12            150
15            250
17            200
Code
# check to make sure no missing values remain
print(sum(is.na(median_test)))
[1] 0

Mode imputation

Create a new dataset where all missing values are replaced with the mode of the feature column.

Code
# Function for mode imputation (using the most common value)
mode_impute <- function(x) {
  mode_val <- as.numeric(names(sort(table(x), decreasing = TRUE)[1]))
  x[is.na(x)] <- mode_val
  return(x)
}

# impute on training set
mode_train <- apply(train_ozone2, 2, mode_impute)
mode_train <- data.frame(mode_train)

head(mode_train)
    Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
179            13     6           27           7         5880              5
14              4     1           14           3         5780              6
195            13     7           13           2         5850              9
306             7    11            1           1         5860              3
118             6     4           27           2         5620              9
299            13    10           25           1         5820              5
    Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
179           43            90   54.32     580                 9   87.26
14            19            54   56.12    5000               -44   56.30
195           72            77   69.44    3389                56   68.72
306           19            66   65.12    5000               -32   51.26
118           47            56   39.92    5000                43   38.30
299           71            48   54.32    1899                21   62.06
    Visibility_LAX
179             80
14             200
195             80
306             60
118            140
299             40
Code
# check to make sure no missing values remain
print(sum(is.na(mode_train)))
[1] 0
Code
# impute on test set
mode_test <- apply(test_ozone2, 2, mode_impute)
mode_test <- data.frame(mode_test)

head(mode_test)
   Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
3              3     1            3           6         5710              4
6              6     1            6           2         5720              4
8              4     1            8           4         5790              3
12             6     1           12           1         5720              3
15             4     1           15           4         5830              3
17             5     1           17           6         5840              5
   Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
3            28            40   38.12    2693               -25   47.66
6            69            35   49.64    1568                15   53.78
8            25            55   52.70     554               -28   64.76
12           44            51   54.32     111                 9   63.14
15           19            58   62.24    1249               -53   75.74
17           19            64   38.12    5000               -40   63.32
   Visibility_LAX
3             250
6              60
8             250
12            150
15            250
17            200
Code
# check to make sure no missing values remain
print(sum(is.na(mode_test)))
[1] 0

Missing Data Imputation Using MICE (Multiple Imputation Method) and missForest (Machine Learning Method)

Next, the missing data were imputed using MICE and Miss_Forest methods.

missForest

MICE For determining the number of imputations, a rule of thumb is to do one imputation for every 1% of missing data. Since we have approximately 4% of missing data in the dataset, we used 4 imputations. The method chosen is the Predictive Mean Matching (pmm) which is suitable for continuous variables like temperature, wind, etc. The total iterations must be enough to reach convergence, and the typical range is 5-20. The function was run at maxit=5, 10, and 20. At maxit=20, it appears convergence has been reached. Therefore, the final number of iterations used is 20 (Buuren 2018).

Code
mice_train2 <- mice(train_ozone2, method = "pmm", m = 4, maxit = 10, print=FALSE)
plot(mice_train2)

Code
mice_train3 <- mice(train_ozone2, method = "pmm", m = 4, maxit = 20, print=FALSE)
plot(mice_train3)

Code
# impute on training set
# extracts the completed datasets from the mice object
mice_train <- complete(mice_train3)

# Convert completed data to data frame
mice_train <- as.data.frame(mice_train)

head(mice_train)
    Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
179            13     6           27           7         5880              5
14              4     1           14           3         5780              6
195            13     7           13           2         5850              9
306             7    11            1           1         5860              3
118             6     4           27           2         5620              9
299            13    10           25           1         5820              5
    Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
179           43            90   76.46     580                 9   87.26
14            19            54   56.12    5000               -44   56.30
195           72            77   69.44    3389                56   68.72
306           19            66   65.12    2972               -32   66.92
118           47            56   39.92    5000                43   38.30
299           71            48   59.54    1899                21   62.06
    Visibility_LAX
179             80
14             200
195             80
306             60
118            140
299             40
Code
# check to make sure no missing values remain
print(sum(is.na(mice_train)))
[1] 0
Code
# impute on test set
mice_test <- mice(test_ozone2, method = "pmm", m = 4, maxit = 20, print=FALSE)
#check convergence
plot(mice_test)

Code
# extracts the completed datasets from the mice object
mice_test <- complete(mice_test)

# Convert completed data to data frame
mice_test <- as.data.frame(mice_test)

head(mice_test)
   Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
3              3     1            3           6         5710              4
6              6     1            6           2         5720              4
8              4     1            8           4         5790              3
12             6     1           12           1         5720              3
15             4     1           15           4         5830              3
17             5     1           17           6         5840              5
   Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
3            28            40   49.64    2693               -25   47.66
6            69            35   49.64    1568                15   53.78
8            25            55   52.70     554               -28   64.76
12           44            51   54.32     111                 9   63.14
15           19            58   62.24    1249               -53   75.74
17           19            64   65.12    5000               -40   63.32
   Visibility_LAX
3             250
6              60
8             250
12            150
15            250
17            200
Code
# check to make sure no missing values remain
print(sum(is.na(mice_test)))
[1] 0

Model-Fitting of Imputed Datasets

Make Predictions Using Data Where Missing Values Were Deleted Using Listwise Deletion (Deletion of all missing rows with missing data)

Models were fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

Code
# Function to calculate RMSE
rmse <- function(pred, actual) {
  sqrt(mean((pred - actual)^2))
}

Random Forest

Code
# Random forest model
rf_dropNA <- randomForest(Ozone_reading ~ ., data = dropNA_train)

# make predictions
rf_dropNA_pred <- predict(rf_dropNA, newdata = dropNA_test)

# Plot variable importance
varImpPlot(rf_dropNA, main = "Variable Importance Plot for Random Forest Model")

Code
# calculate RMSE
rf_dropNA_rmse <- rmse(rf_dropNA_pred, dropNA_test$Ozone_reading)
cat("Random Forest RMSE:", rf_dropNA_rmse, "\n")
Random Forest RMSE: 4.510765 

KNN

Code
# KNN model
knn_dropNA <- train(Ozone_reading ~ ., data = dropNA_train, method = "knn")

# make predictions
knn_dropNA_pred <- predict(knn_dropNA, newdata = dropNA_test)

# Plot variable importance
vi <- varImp(knn_dropNA)
plot(vi)

Code
# Calculate importance for the KNN model
#library(hstats)
#importance <- perm_importance(knn_dropNA, dropNA_train[, -ncol(dropNA_train)], dropNA_train$Ozone_reading)
#vip(importance)


# calculate RMSE
knn_dropNA_rmse <- rmse(knn_dropNA_pred, dropNA_test$Ozone_reading)
cat("KNN RMSE:", knn_dropNA_rmse, "\n")
KNN RMSE: 6.180034 

Decision Tree

Code
# Decision tree model
tree_dropNA <- rpart(Ozone_reading ~ ., data = dropNA_train)

# make predictions
tree_dropNA_pred <- predict(tree_dropNA, newdata = dropNA_test)

# Plot variable importance
var_importance <- varImp(tree_dropNA)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
# calculate RMSE
tree_dropNA_rmse <- rmse(tree_dropNA_pred, dropNA_test$Ozone_reading)
cat("Decision Tree RMSE:", tree_dropNA_rmse, "\n")
Decision Tree RMSE: 5.122975 

Make Predictions Using Data Where Columns with >20% Missing Data Were Deleted (Temp_EM column)

Models were fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

Random Forest

Code
# Random forest model
rf_dropCol <- randomForest(Ozone_reading ~ ., data = dropCol_train)

# make predictions
rf_dropCol_pred <- predict(rf_dropCol, newdata = dropCol_test)

# Plot variable importance
varImpPlot(rf_dropCol, main = "Variable Importance Plot for Random Forest Model")

Code
# calculate RMSE
rf_dropCol_rmse <- rmse(rf_dropCol_pred, dropCol_test$Ozone_reading)
cat("Random Forest RMSE:", rf_dropCol_rmse, "\n")
Random Forest RMSE: 1.702823 

KNN

Code
# KNN model
knn_dropCol <- train(Ozone_reading ~ ., data = dropCol_train, method = "knn")

# make predictions
knn_dropCol_pred <- predict(knn_dropCol, newdata = dropCol_test)

# Plot variable importance
vi <- varImp(knn_dropCol)
plot(vi)

Code
# calculate RMSE
knn_dropCol_rmse <- rmse(knn_dropCol_pred, dropCol_test$Ozone_reading)
cat("KNN RMSE:", knn_dropCol_rmse, "\n")
KNN RMSE: 4.850616 

Decision Tree

Code
# Decision tree model
tree_dropCol <- rpart(Ozone_reading ~ ., data = dropCol_train)

# make predictions
tree_dropCol_pred <- predict(tree_dropCol, newdata = dropCol_test)

# Plot variable importance
var_importance <- varImp(tree_dropCol)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
# calculate RMSE
tree_dropCol_rmse <- rmse(tree_dropCol_pred, dropCol_test$Ozone_reading)
cat("Decision Tree RMSE:", tree_dropCol_rmse, "\n")
Decision Tree RMSE: 3.55154 

Make Predictions Using Data Where Missing Values were Imputed using the Mean

Models were fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

Random Forest

Code
# Random forest model
rf_mean <- randomForest(Ozone_reading ~ ., data = mean_train)

# make predictions
rf_mean_pred <- predict(rf_mean, newdata = mean_test)

# Plot variable importance
varImpPlot(rf_mean, main = "Variable Importance Plot for Random Forest Model")

Code
# calculate RMSE
rf_mean_rmse <- rmse(rf_mean_pred, mean_test$Ozone_reading)
cat("Random Forest RMSE:", rf_mean_rmse, "\n")
Random Forest RMSE: 4.937651 

KNN

Code
# KNN model
knn_mean <- train(Ozone_reading ~ ., data = mean_train, method = "knn")

# make predictions
knn_mean_pred <- predict(knn_mean, newdata = mean_test)

# Plot variable importance
vi <- varImp(knn_mean)
plot(vi)

Code
# calculate RMSE
knn_mean_rmse <- rmse(knn_mean_pred, mean_test$Ozone_reading)
cat("KNN RMSE:", knn_mean_rmse, "\n")
KNN RMSE: 6.224075 

Decision Tree

Code
# Decision tree model
tree_mean <- rpart(Ozone_reading ~ ., data = mean_train)

# make predictions
tree_mean_pred <- predict(tree_mean, newdata = mean_test)

# Plot variable importance
var_importance <- varImp(tree_mean)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
# calculate RMSE
tree_mean_rmse <- rmse(tree_mean_pred, mean_test$Ozone_reading)
cat("Decision Tree RMSE:", tree_mean_rmse, "\n")
Decision Tree RMSE: 5.211102 

Make Predictions Using Data Where Missing Values were Imputed using the Median

Models were fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

Random Forest

Code
# Random forest model
rf_med <- randomForest(Ozone_reading ~ ., data = median_train)
rf_med_pred <- predict(rf_med, newdata = median_test)

# Plot variable importance
varImpPlot(rf_med, main = "Variable Importance Plot for Random Forest Model")

Code
# calculate RMSE
rf_med_rmse <- rmse(rf_med_pred, median_test$Ozone_reading)
cat("Random Forest RMSE:", rf_med_rmse, "\n")
Random Forest RMSE: 5.083853 

KNN

Code
# KNN model
knn_med <- train(Ozone_reading ~ ., data = median_train, method = "knn")
knn_med_pred <- predict(knn_med, newdata = median_test)

# Plot variable importance
vi <- varImp(knn_med)
plot(vi)

Code
# calculate RMSE
knn_med_rmse <- rmse(knn_med_pred, median_test$Ozone_reading)
cat("KNN RMSE:", knn_med_rmse, "\n")
KNN RMSE: 6.413804 

Decision Tree

Code
# Decision tree model
tree_med <- rpart(Ozone_reading ~ ., data = median_train)
tree_med_pred <- predict(tree_med, newdata = median_test)

# Plot variable importance
var_importance <- varImp(tree_med)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
# calculate RMSE
tree_med_rmse <- rmse(tree_med_pred, median_test$Ozone_reading)
cat("Decision Tree RMSE:", tree_med_rmse, "\n")
Decision Tree RMSE: 5.245885 

Make Predictions Using Data Where Missing Values were Imputed using the Mode

Models were fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

Random Forest

Code
# Random forest model
rf_mode <- randomForest(Ozone_reading ~ ., data = mode_train)
rf_mode_pred <- predict(rf_mode, newdata = mode_test)

# Plot variable importance
varImpPlot(rf_mode, main = "Variable Importance Plot for Random Forest Model")

Code
# calculate RMSE
rf_mode_rmse <- rmse(rf_mode_pred, mode_test$Ozone_reading)
cat("Random Forest RMSE:", rf_mode_rmse, "\n")
Random Forest RMSE: 5.382862 

KNN

Code
# KNN model
knn_mode <- train(Ozone_reading ~ ., data = mode_train, method = "knn")
knn_mode_pred <- predict(knn_mode, newdata = mode_test)

# Plot variable importance
vi <- varImp(knn_mode)
plot(vi)

Code
# calculate RMSE
knn_mode_rmse <- rmse(knn_mode_pred, mode_test$Ozone_reading)
cat("KNN RMSE:", knn_mode_rmse, "\n")
KNN RMSE: 6.3718 

Decision Tree

Code
# Decision tree model
tree_mode <- rpart(Ozone_reading ~ ., data = mode_train)
tree_mode_pred <- predict(tree_mode, newdata = mode_test)

# Plot variable importance
var_importance <- varImp(tree_mode)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
# calculate RMSE
tree_mode_rmse <- rmse(tree_mode_pred, mode_test$Ozone_reading)
cat("Decision Tree RMSE:", tree_mode_rmse, "\n")
Decision Tree RMSE: 6.559783 

Make Predictions using data where missing values were imputed with complex methods:

Make Predictions Using Data Where Missing Values were Imputed using missForest Method

Models were fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

Random Forest

Code
# Random forest model
rf_miss <- randomForest(Ozone_reading ~ ., data = mf_train)
rf_miss_pred <- predict(rf_miss, newdata = mf_test)

# Plot variable importance
varImpPlot(rf_miss, main = "Variable Importance Plot for Random Forest Model")

Code
# calculate RMSE
rf_miss_rmse <- rmse(rf_miss_pred, mf_test$Ozone_reading)
cat("Random Forest RMSE:", rf_miss_rmse, "\n")
Random Forest RMSE: 4.536337 

KNN

Code
# KNN model
knn_miss <- train(Ozone_reading ~ ., data = mf_train, method = "knn")
knn_miss_pred <- predict(knn_miss, newdata = mf_test)

# Plot variable importance
vi <- varImp(knn_miss)
plot(vi)

Code
# calculate RMSE
knn_miss_rmse <- rmse(knn_miss_pred, mf_test$Ozone_reading)
cat("KNN RMSE:", knn_miss_rmse, "\n")
KNN RMSE: 6.15751 

Decision Tree

Code
# Decision tree model
tree_miss <- rpart(Ozone_reading ~ ., data = mf_train)
tree_miss_pred <- predict(tree_miss, newdata = mf_test)

# Plot variable importance
var_importance <- varImp(tree_miss)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
# calculate RMSE
tree_miss_rmse <- rmse(tree_miss_pred, mf_test$Ozone_reading)
cat("Decision Tree RMSE:", tree_miss_rmse, "\n")
Decision Tree RMSE: 5.618096 

Make Predictions Using Data Where Missing Values were Imputed using MICE Method

Models were fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

Random Forest

Code
# Random forest model
rf_mice <- randomForest(Ozone_reading ~ ., data = mice_train)
rf_mice_pred <- predict(rf_mice, newdata = mice_test)

# Plot variable importance
varImpPlot(rf_mice, main = "Variable Importance Plot for Random Forest Model")

Code
# calculate RMSE
rf_mice_rmse <- rmse(rf_mice_pred, mice_test$Ozone_reading)
cat("Random Forest RMSE:", rf_mice_rmse, "\n")
Random Forest RMSE: 4.573042 

KNN

Code
# KNN model
knn_mice <- train(Ozone_reading ~ ., data = mice_train, method = "knn")
knn_mice_pred <- predict(knn_mice, newdata = mice_test)

# Plot variable importance
vi <- varImp(knn_mice)
plot(vi)

Code
# calculate RMSE
knn_mice_rmse <- rmse(knn_mice_pred, mice_test$Ozone_reading)
cat("KNN RMSE:", knn_mice_rmse, "\n")
KNN RMSE: 6.428537 

Decision Tree

Code
# Decision tree model
tree_mice <- rpart(Ozone_reading ~ ., data = mice_train)
tree_mice_pred <- predict(tree_mice, newdata = mice_test)

# Plot variable importance
var_importance <- varImp(tree_mice)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
# calculate RMSE
tree_mice_rmse <- rmse(tree_mice_pred, mice_test$Ozone_reading)
cat("Decision Tree RMSE:", tree_mice_rmse, "\n")
Decision Tree RMSE: 5.572477 

Summarize Model Performance for Each Imputation Methodology

The RMSE for each model and imputation method combination was summarized into a data frame. The results are shown below. The best model fit (lowest RMSE) was obtained when using feature selection (column deletion) for imputation of missing data and the random forest algorithm for model training with the resulting dataset. Mean imputation with the Decision Tree model fitting was the second best combination, followed by median imputation with Random Forest.

Code
models <- c('RandomForest', 'KNN', 'DecisionTree')
scores <- c(rf_dropCol_rmse, knn_dropCol_rmse, tree_dropCol_rmse,
            rf_dropNA_rmse, knn_dropNA_rmse, tree_dropNA_rmse,
            rf_mean_rmse, knn_mean_rmse, tree_mean_rmse,
            rf_med_rmse, knn_med_rmse, tree_med_rmse,
            rf_mode_rmse, knn_mode_rmse, tree_mode_rmse,
            rf_miss_rmse, knn_miss_rmse, tree_miss_rmse,
            rf_mice_rmse, knn_mice_rmse, tree_mice_rmse
)
ImpMethod <- c('DropCol','DropNA','Mean', 'Median', 'Mode','missForest','MICE')

# Create dataframe
rmse_df <- data.frame(Model = models, ImpMethod=ImpMethod,RMSE = scores)
print(rmse_df[order(rmse_df$RMSE), ])
          Model  ImpMethod     RMSE
1  RandomForest    DropCol 1.702823
3  DecisionTree       Mean 3.551540
4  RandomForest     Median 4.510765
16 RandomForest     DropNA 4.536337
19 RandomForest       Mode 4.573042
2           KNN     DropNA 4.850616
7  RandomForest       MICE 4.937651
10 RandomForest       Mean 5.083853
6  DecisionTree missForest 5.122975
9  DecisionTree     DropNA 5.211102
12 DecisionTree       Mode 5.245885
13 RandomForest missForest 5.382862
21 DecisionTree       MICE 5.572477
18 DecisionTree     Median 5.618096
17          KNN       Mean 6.157510
5           KNN       Mode 6.180034
8           KNN    DropCol 6.224075
14          KNN       MICE 6.371800
11          KNN     Median 6.413804
20          KNN missForest 6.428537
15 DecisionTree    DropCol 6.559783
Code
rmse_df$Combined_Methods <- paste(rmse_df$ImpMethod, rmse_df$Model)

#Create a bar chart
library(forcats)
library(ggplot2)
library(dplyr)

#Make a bar chart
rmse_df %>%
  mutate(Combined_Methods = fct_reorder(Combined_Methods, desc(RMSE))) %>%
  ggplot (aes(x=Combined_Methods, y=RMSE)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  xlab("") + 
  theme_bw()

Conclusions

In summary, maintaining the integrity and dependability of data analysis procedures depends on the management of missing data. This work demonstrates various methodologies that can be used to handle missing data, including listwise and feature selection deletion methods, machine learning-based algorithms like missForest, single imputation techniques such as mean, mode, median imputation, and multiple imputation using MICE. The dataset “Ozone”, from the mlbench package in R was used to demonstrate the techniques for handling missing data. Statistical tests and visualization techniques were also used to help classify data as MCAR, MAR, and NMAR. Missing data were deleted or imputed using the seven methodologies described above and each resulting dataset was used to train three machine learning models: decision tree, random forest, and KNN.

For the Ozone dataset, the data were found to be MCAR using the non-parametric test provided in the missMech package. The random forest model in combination with feature selection (column deletion) for handling missing data resulted in the lowest RMSE with respect to the test (unseen) dataset. This may have been the best result in this instance for a few reasons. First, the Temp_EM column contains the majority of the missing data in this dataset, with 38% of its values missing. Secondly, there was a high degree of collinearity (|r| > 0.5) between several of the model features, including Temp_EM. Thirdly, the data appear to be MCAR based on statistical testing. Therefore, deletion of missing data should not result in significant bias.

There is no single “best” method for hanlding missing data. It is best practice to test a variety of different methods for deletion or imputation, as was performed in this work. It is also important to pre-define success criteria when handling missing data. To provide reliable and accurate results from data analysis, the imputation method selected should ultimately be in line with the unique features of the dataset and the analytical objectives.

References

Awan, Bennamoun, S. E. 2022. “A Reinforcement Learning-Based Approach for Imputing Missing Data.” Neural Computing and Applications 34 (12): 9701–16.
Azur, Melissa J, Elizabeth A Stuart, Constantine Frangakis, and Philip J Leaf. 2011. “Multiple Imputation by Chained Equations: What Is It and How Does It Work?” International Journal of Methods in Psychiatric Research 20 (1): 40–49.
Batista, Gustavo EAPA, and Maria Carolina Monard. 2003. “An Analysis of Four Missing Data Treatment Methods for Supervised Learning.” Applied Artificial Intelligence 17 (5-6): 519–33.
Bennett, Derrick A. 2001. “How Can i Deal with Missing Data in My Study?” Australian and New Zealand Journal of Public Health 25 (5): 464–69.
Buuren, S. van. 2018. Flexible Imputation of Missing Data. CRC Press. https://books.google.com/books/about/Flexible_Imputation_of_Missing_Data_Seco.html?id=lzb3DwAAQBAJ&source=kp_book_description.
Du, Jinghan, Minghua Hu, and Weining Zhang. 2020. “Missing Data Problem in the Monitoring System: A Review.” IEEE Sensors Journal 20 (23): 13984–98.
Emmanuel, Maupong, T. 2021. “A Survey on Missing Data in Machine Learning.” J Big Data 8: 401–7.
Ghazali, Shamihah Muhammad, Norshahida Shaadan, and Zainura Idrus. 2020. “Missing Data Exploration in Air Quality Data Set Using r-Package Data Visualisation Tools.” Bulletin of Electrical Engineering and Informatics 9 (2): 755–63.
Guzel, Kaya, C. 2013. “Breast Cancer Diagnosis Based on Naïve Bayes Machine Learning Classifier with KNN Missing Data Imputation.” AWERProcedia Information Technology & Computer Science 4: 332–46.
Ibrahim, Chen, J. G. 2011. “Missing-Data Methods for Generalized Linear Models.” Journal of the American Statistical Association 100 (469).
Jerez, Molina, J. M. 2010. “Missing Data Imputation Using Statistical and Machine Learning Methods in a Real Breast Cancer Problem.” Artificial Intelligence in Medicine 50 (2): 105–15.
Li, Cheng. 2013. “Little’s Test of Missing Completely at Random.” The Stata Journal 13 (4): 795–809.
Lin, Wei-Chao, and Chih-Fong Tsai. 2020. “Missing Value Imputation: A Review and Analysis of the Literature (2006–2017).” Artificial Intelligence Review 53: 1487–1509.
Little, Todd D, Terrence D Jorgensen, Kyle M Lang, and E Whitney G Moore. 2014. “On the Joys of Missing Data.” Journal of Pediatric Psychology 39 (2): 151–62.
Mack, Christina, Zhaohui Su, and Daniel Westreich. 2018. “Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide.”
May, Michael, Christine Körner, Dirk Hecker, Martial Pasquier, Urs Hofmann, and Felix Mende. 2009. “Modelling Missing Values for Audience Measurement in Outdoor Advertising Using GPS Data.” In GI Jahrestagung, 3993–4006.
Ren, Lijuan, Tao Wang, Aicha Sekhari Seklouli, Haiqing Zhang, and Abdelaziz Bouras. 2023. “A Review on Missing Values for Main Challenges and Methods.” Information Systems, 102268.
Rubin, Donald B. 2004. Multiple Imputation for Nonresponse in Surveys. Vol. 81. John Wiley & Sons.
Schafer, Joseph L. 1997. Analysis of Incomplete Multivariate Data. CRC press.
Schefer, J. 2002. “Dealing with Missing Data.” Research Letters in the Information and Mathematical Sciences 3 (1): 153–60.
Tang, Fei. 2017. Random Forest Missing Data Approaches. University of Miami.
Zhang, Sun, X. 2024. “A Matrix Completion Method for Imputing Missing Values of Process Data.” Processes 12 (4): 659.
Zheng, Siming, Juan Zhang, and Yong Zhou. 2023. “Likelihood Identifiability and Parameter Estimation with Nonignorable Missing Data.” Canadian Journal of Statistics 51 (4): 1190–1209.